[MLAS] Add an NHWC implementation of convolution to avoid transposes by orlmon01 · Pull Request #26834 · microsoft/onnxruntime

orlmon01 · 2025-12-19T10:59:07Z

Modification to the CPU EP to specify channels_last when data format is NWHC
Added a FusedNhwcConv kernel
Implementation of the kernel in mlas
Added compiler guards so it is inly used with KleidiAi (for now, can be removed if needed)
Added unittests

Description

Currently OnnxRT supports NCHW as a default datalayout. For optimisations and kernels that operate better in NHWC layout, or where the datalayout is NHWC in the first place Transposes are added around the layers. This patch seeks to eliminate them in cases of convolutions where it would cause a performance decrease.

Motivation and Context

KleidiAi specific implementation of this feature. Only supports convolutions, DepthWise to follow. Currently a little strict with the filters as a result.

…transposes * Modification to the CPU EP to specify channels_last when data format is NWHC * Added a FusedNhwcConv kernel * Implementation of the kernel in mlas * Added compiler guards so it is inly used with KleidiAi (for now, can be removed if needed) * Added unittests Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

orlmon01 · 2025-12-19T11:00:53Z

@microsoft-github-policy-service agree company="Arm"

orlmon01 · 2025-12-19T12:35:47Z

Feedback appreciated as this PR makes quite a lot of changes to the codebase well outside of the normal KleidiAI scope.

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Rohanjames1997 · 2025-12-22T22:23:21Z

Hi @orlmon01, I imagine that avoiding transposes also improves performance.
Do you have any performance results to share?
TIA!

orlmon01 · 2026-01-06T10:07:09Z

Hi @orlmon01, I imagine that avoiding transposes also improves performance. Do you have any performance results to share? TIA!

Hiya,
Sorry, just back after the holidays. Yes, there is a performance increase. It depends on the model. Ones with multiple consecutive Convolutions where transposes can be eliminated will see a larger speedup. Even with the limited range of convolutions it's implemented for there should still be a performance increase in most cases.

I have some numbers somewhere from a Mobilenet model I was using for testing that I'll add in a bit, once I find / regenerate them. :)

orlmon01 · 2026-01-06T10:48:56Z

mobilenet model without the current patch:

Setting intra_op_num_threads to 1
Overriding dimension with name, N, to 1
Overriding dimension with name, T, to 1000
Overriding dimension with name, cache_T_attn, to 32
Overriding dimension with name, right_context, to 5
Session creation time cost: 0.020627 s
First inference time cost: 10 ms
Total inference time cost: 1.33257 s
Total inference requests: 200
Average inference time cost total: 6.662851 ms
Total inference run time: 1.33266 s
Number of inferences per second: 150.075
Avg CPU usage: 16 %
Peak working set size: 85429583872 bytes
Avg CPU usage:16
Peak working set size:85429583872
Runs:200
Min Latency: 0.006193 s
Max Latency: 0.007625 s
P50 Latency: 0.00666992 s
P90 Latency: 0.00686983 s
P95 Latency: 0.00694425 s
P99 Latency: 0.00733196 s
P999 Latency: 0.007625 s

Same model with changes:

Setting intra_op_num_threads to 1
Overriding dimension with name, N, to 1
Overriding dimension with name, T, to 1000
Overriding dimension with name, cache_T_attn, to 32
Overriding dimension with name, right_context, to 5
Session creation time cost: 0.0217724 s
First inference time cost: 7 ms
Total inference time cost: 1.12897 s
Total inference requests: 200
Average inference time cost total: 5.644857 ms
Total inference run time: 1.12905 s
Number of inferences per second: 177.14
Avg CPU usage: 16 %
Peak working set size: 80362864640 bytes
Avg CPU usage:16
Peak working set size:80362864640
Runs:200
Min Latency: 0.00527438 s
Max Latency: 0.006706 s
P50 Latency: 0.00566529 s
P90 Latency: 0.00579958 s
P95 Latency: 0.00585429 s
P99 Latency: 0.00639058 s
P999 Latency: 0.006706 s

edgchen1 · 2026-01-08T17:35:32Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-08T17:35:52Z

Azure Pipelines successfully started running 4 pipeline(s).

Update to the internal_testings_tests helper macros for file expansion so it works on other platforms

Fix for failing ConvDepthwiseFloat test, allows for a small tolerance when running on different hardware

For for failing TestSaveAndLoadOrtModel test Make sure the model being saved / loaded is being done from a writeable location

Fix for undeclared identifier linker error

hariharans29 · 2026-01-14T17:45:21Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-14T17:45:41Z

Azure Pipelines successfully started running 4 pipeline(s).

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

hariharans29 · 2026-01-15T15:58:58Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-01-15T15:59:23Z

Azure Pipelines successfully started running 4 pipeline(s).

Copilot

Pull request overview

This PR adds an NHWC (channels-last) implementation of convolution operations to avoid costly transpose operations in the CPU execution provider. The implementation includes KleidiAI-specific optimizations and a fallback path for NHWC convolutions.

Changes:

Added NhwcFusedConv kernel for float32 convolutions in NHWC layout (KleidiAI-guarded)
Implemented NHWC fast path and fallback path with explicit NHWC↔NCHW conversions in MLAS
Extended test infrastructure to resolve paths dynamically and filter NHWC transformers in existing tests

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
onnxruntime/core/providers/cpu/nn/conv.h	Added `channels_last_` flag to Conv kernel
onnxruntime/core/providers/cpu/nn/conv.cc	Implemented NHWC convolution logic with fast path and fallback
onnxruntime/core/optimizer/nhwc_transformer.cc	Added KleidiAI filter and FusedConv sum input handling
onnxruntime/core/mlas/lib/convolve.cpp	Added `ChannelsLast` parameter to MlasConvPrepare
onnxruntime/core/mlas/inc/mlas.h	Added `ChannelsLast` field to MLAS_CONV_PARAMETERS
onnxruntime/contrib_ops/cpu/fused_conv.cc	Registered NhwcFusedConv kernel
onnxruntime/test/optimizer/nhwc_transformer_test.cc	Added depthwise convolution test case
onnxruntime/test/optimizer/fuse_initializers_transformer_test.cc	Filtered NhwcTransformer in tests
onnxruntime/test/optimizer/conv_add_act_test.cc	Updated to handle both FusedConv variants
onnxruntime/test/internal_testing_ep/internal_testing_tests.cc	Added path resolution utilities
onnxruntime/test/framework/ort_model_only_test.cc	Added path resolution with diagnostic output
onnxruntime/core/util/math_cpu.cc	Added Im2col instantiation for float NHWC
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp	Updated to support channels-last input

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ons are 0 Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

…r if needed Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

hariharans29 · 2026-04-29T17:28:18Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-04-29T17:28:26Z

No pipelines are associated with this pull request.

Copilot

Pull request overview

Copilot reviewed 35 out of 35 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* NhwcFusedConv should now be available in minimal builds as the resolver bytes contain it Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

* RHS hashing now uses the full tensor to ensure uniqueness * LHS no longer uses hashing as it's unnecessary Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

hariharans29 · 2026-04-30T18:54:52Z

Can you please push the doc change too ? I ll kick off CI post that

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

orlmon01 · 2026-04-30T19:01:04Z

Can you please push the doc change too ? I ll kick off CI post that

Done. I hadn't seen your message when I was pushing the Co-Pilot fix. :)

hariharans29 · 2026-04-30T19:03:28Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-04-30T19:03:35Z

No pipelines are associated with this pull request.

Copilot

Pull request overview

Copilot reviewed 36 out of 36 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

onnxruntime/test/optimizer/nhwc_transformer_test.cc:1

This helper treats stride == 0 as valid (it only rejects < 0) and will propagate it into padding/output-shape computation, which can cause invalid behavior (stride must be > 0 in ONNX Conv). To keep the test-side capability predicate aligned with the production-side checks (and avoid accidental divide-by-zero paths), change the validation here (and the analogous dilations check) to reject <= 0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

edgchen1 · 2026-05-01T18:35:24Z

+  ORT_RETURN_IF_ERROR(
+      kernel_type_str_resolver_utils::AddLayoutTransformationRequiredOpsToKernelTypeStrResolver(
+          kernel_type_str_resolver));


why do we need to add them to the saved ORT format model? they are already added at load time:

onnxruntime/onnxruntime/core/session/inference_session.cc

Lines 1896 to 1900 in 8dd4a06

#if !defined(ORT_MINIMAL_BUILD) || defined(ORT_EXTENDED_MINIMAL_BUILD)

ORT_RETURN_IF_ERROR(

kernel_type_str_resolver_utils::AddLayoutTransformationRequiredOpsToKernelTypeStrResolver(

kernel_type_str_resolver));

#endif // !defined(ORT_MINIMAL_BUILD) || defined(ORT_EXTENDED_MINIMAL_BUILD)

edgchen1 · 2026-05-01T18:43:33Z

 #endif  // !defined(DISABLE_CONTRIB_OPS)
 }

+#if !defined(ORT_MINIMAL_BUILD) && !defined(DISABLE_CONTRIB_OPS) && defined(USE_KLEIDIAI)


is this specific to USE_KLEIDIAI? NhwcFusedConv is added to the layout transformation ops even if USE_KLEIDIAI is not defined.

edgchen1 · 2026-05-01T18:47:37Z

+  ASSERT_FALSE(resolved_args.empty());
+}
+
+TEST(KernelTypeStrResolverUtilsTest, SavedOrtModelResolverContainsNhwcFusedConv) {


I think the ORT format model itself doesn't need to contain the layout transformation ops if we add them at load time. that decision was originally made to avoid unnecessarily increasing the model size.

orlmon01 marked this pull request as draft December 19, 2025 12:34

orlmon01 marked this pull request as ready for review December 19, 2025 12:35

Add a value for channels_last to bench_sconv.cpp

1606a1c

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Merge branch 'microsoft:main' into main

f80cc39

orlmon01 added 6 commits January 12, 2026 10:55

Merge branch 'microsoft:main' into main

6045333

Update internal_testing_tests.cc

2dd199e

Update to the internal_testings_tests helper macros for file expansion so it works on other platforms

Merge branch 'microsoft:main' into main

eb026d1

Update nhwc_transformer_test.cc

4df9cea

Fix for failing ConvDepthwiseFloat test, allows for a small tolerance when running on different hardware

Update internal_testing_tests.cc

b133782

For for failing TestSaveAndLoadOrtModel test Make sure the model being saved / loaded is being done from a writeable location

Update ort_model_only_test.cc

0c2d1cd

Fix for undeclared identifier linker error

orlmon01 added 2 commits January 14, 2026 18:00

Lintrunner fixes

25c0be7

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Merge branch 'microsoft:main' into main

bee0892

hariharans29 requested a review from Copilot January 16, 2026 00:20

Copilot started reviewing on behalf of hariharans29 January 16, 2026 00:21 View session

Copilot AI reviewed Jan 16, 2026

View reviewed changes

orlmon01 added 2 commits January 16, 2026 12:02

Merge branch 'microsoft:main' into main

a64af7c

Merge branch 'microsoft:main' into main

bc1ada6

hariharans29 changed the title ~~Add an implementation an NHWC implementation of convolution to avoid transposes~~ [MLAS] Add an implementation an NHWC implementation of convolution to avoid transposes Jan 21, 2026

orlmon01 added 2 commits April 29, 2026 15:53

Fix for CoPilot issue to prevent dividing by 0 when strides or dilati…

9f7d631

…ons are 0 Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Moving the channels last check earlier in conv.cc so it returns soone…

fff37d9

…r if needed Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

hariharans29 requested a review from Copilot April 29, 2026 17:30

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Comment thread onnxruntime/core/optimizer/conv_add_act_fusion.cc

Comment thread onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp

Comment thread onnxruntime/core/providers/cpu/nn/conv.cc

Copilot started reviewing on behalf of hariharans29 April 29, 2026 17:43 View session

edgchen1 reviewed Apr 29, 2026

View reviewed changes

Comment thread onnxruntime/core/framework/kernel_type_str_resolver.cc

orlmon01 added 8 commits April 30, 2026 09:40

Update kernel_type_str_resolver.cc whitespace change

9c62be6

Update kernel_type_str_resolver.cc

3167f65

Update fuse_initializers_transformer_test.cc

fcd3f14

Remove the conditional ifdef from kernel_type_str_resolver_utils.cc

38f9a2b

* NhwcFusedConv should now be available in minimal builds as the resolver bytes contain it Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Update the hashing in convolve_kleidi to reduce the aliasing risk

057f493

* RHS hashing now uses the full tensor to ensure uniqueness * LHS no longer uses hashing as it's unnecessary Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Merge branch 'microsoft:main' into main

74a3366

Remove an unnecessary copy in conv.cc

7d37596

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Merge branch 'microsoft:main' into main

f403ebe

Generated Docs update

54415bf

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

hariharans29 reviewed Apr 30, 2026

View reviewed changes

Comment thread onnxruntime/core/optimizer/nchwc_transformer.cc Outdated

hariharans29 requested a review from Copilot April 30, 2026 19:13

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Comment thread onnxruntime/core/session/inference_session.cc

Comment thread onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp

Copilot started reviewing on behalf of hariharans29 April 30, 2026 19:46 View session

Removing unneeded nullptr chwck in nchwc_transformer.cc

39edc59

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

hariharans29 approved these changes May 1, 2026

View reviewed changes

edgchen1 reviewed May 1, 2026

View reviewed changes

	#if !defined(ORT_MINIMAL_BUILD) \|\| defined(ORT_EXTENDED_MINIMAL_BUILD)
	ORT_RETURN_IF_ERROR(
	kernel_type_str_resolver_utils::AddLayoutTransformationRequiredOpsToKernelTypeStrResolver(
	kernel_type_str_resolver));
	#endif // !defined(ORT_MINIMAL_BUILD) \|\| defined(ORT_EXTENDED_MINIMAL_BUILD)

Conversation

orlmon01 commented Dec 19, 2025

Description

Motivation and Context

Uh oh!

orlmon01 commented Dec 19, 2025

Uh oh!

orlmon01 commented Dec 19, 2025

Uh oh!

Rohanjames1997 commented Dec 22, 2025

Uh oh!

orlmon01 commented Jan 6, 2026

Uh oh!

orlmon01 commented Jan 6, 2026

Uh oh!

edgchen1 commented Jan 8, 2026

Uh oh!

azure-pipelines Bot commented Jan 8, 2026

Uh oh!

hariharans29 commented Jan 14, 2026

Uh oh!

azure-pipelines Bot commented Jan 14, 2026

Uh oh!

hariharans29 commented Jan 15, 2026

Uh oh!

azure-pipelines Bot commented Jan 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Apr 29, 2026

Uh oh!

azure-pipelines Bot commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Apr 30, 2026

Uh oh!

orlmon01 commented Apr 30, 2026

Uh oh!

hariharans29 commented Apr 30, 2026

Uh oh!

azure-pipelines Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

edgchen1 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

edgchen1 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

edgchen1 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone